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ABSTRACT / 

Inspired by four recent decisions to change 
achievement tests used in the Austin Independent School District, the' 
separate forms used and {procedures followed have been combined into a 
systematic approach inteiraed for use in futute achievement test 
selections. A rating scale (Attachment 1) was developed to expedite a 
systematic comparison among possible achievement tests, and to allow 
a weighting of the factors to be rated according to the school 
system's needs. Five groups of experts (parents, teachers- and 
principals, testing staff, central administration, and the board of 
trustees) have varying responsibility for rating the five* factors 
critical to making the best choice: technical soundness; logical 
feasibility; instructional validity; financial af f ordability ; and 
interpretationa^ ease. The FalJal Flaw Ppinciple (occurring"when an 
essential factor is rated unacceptable) can eliminate a test- 
outright, and the Shoo-in Principle (occurring when a clearly 
superior rating is given on a critical factor) will select a single 
test outright. An outline of eight procedural steps for the selection 
process and the contexts in which they are appropriate is attached. 
(BS) ' . i 
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sizing' UP CANDIDATES FOR A NEW ACHIEVEMENT TEST 

f. . ' l ' ■ 

A major (decision for a school system is the selection of 
a new achievement test. However, in many systems, the people 
charged with making that decision are doing so for the first 
tim^. Even those of us who have done it before are somewhat 
fuzzy about the best process to follow. What should be 
considered? This paper concentrates on what we as educators, 
administrators, and researchers should consider to get the iriost 
for our money and effort when choosing a new ach i e vemen t tes t ' 
The experiences from four major test changes in the Austin Public 
Schools, Austin, Texas, provided the base from which the forms and 
procedures described arose. 

The Austin Public Schools have changed achievement tests 
at various grade levels several times recently. Each t'ime the 
circumstances required a different approach to the decision. The 
fa.ctors which are critical to the. school system somehow, do not 
coincide exactly with the promotional information provided by the 
tes t Vubl 1 sher s . Moreover,' the^factors important to the 
testing office do not coincide exactly with those important to/ 
other^areas of- the school system. The testing office, campus 
staff, and parents each have perspectives which "make them 
the most appropriate ones to judge the merits of'tests 
on different char ac te^r i s t i cs . : Over <he years , we have identified 
the factors which'^must be considered to represent the needs of. v 
the entire school system in the test-selection process.. 

In 1980, a new achievement, test was introduced for grades 
1 through^ 8; in 1982, for kindergarten; in 1982, for high school 
graduation competenci es^ in 1983, for grades 9 through 12. 
As each test has been chafed, we have developed instrument ; 
(description forms^ rating Forms, and summary forms. After the 
new high school test was selected in 1983, the separate forms and 
procedures followed were combined into a unified system or v 
approach to use in selecting future achievement tests. 

m 

A systematic approach to .the sel ecti on of an achievemeat 
test can help ens'ure that the best-; deci s i on. is made^-ihe one that 
returns the most useful information for the investment of money, 
effort, and time. To aid ourselves and others when the time 



comes to select a new achievement test, this paper includes 
information and rating f»orms des'i'gned around the*five factors 
we have found to be most critical to making the best choice. 
These five factors are: 

y' . • 

1.. Technical soundness, 

2. Logistical feasibility, 

3. ' Instructional validity, 

Financial af f ordabi 1 i ty , ' and 
5. Interpretati onal ease. 

Two pr i nci p 1 es will also be descr i bed as pi ay i ng 
important roles in the selection process. ^These are: 

. 1. The Fatal Fl aw Pri nci pi e and 

2. The Shoo-In Principle. 

The Fatal Flaw Principle comes Into play when one 
of the five rating factors, is essential for a new test. In this 
situation, any candidate that rates at an unacceptable level on., 
that critical, essential factor cannot be adoptedj thus, a fatal' 
flaw exists. • 

The Shoo'In Principle is less precise, but a test that ^ 
rates as clearly the superior candidate on a single factor that is 
cr i t i ca 1 "^3nTd- OG 3en t i al becomes a shoo-in in the absence of a 
fatal flaw. 



WHO ARE THE EXPERTS? 

As mentioned Earlier, the perspectives and pr'ior.ities of 
various groups of people differ when judging atchievement test 
candidates. Figure 1 outlines th6 areas in which each of five groups 
appears to^be the mo's t appropr I a te to accept the responsibility 
for ensuring that the correct' choice is made. This is not to say 
that all groups are not to consider. all factors, but the group(s) 
with the major responsibility for each factor i 5 ( ar e ) . i nd i ca ted . 



WHAT ARE .THE ^TEPS FOR SELECTING AM ACHIEVEMENT TEST? 

Figure 2 outlines the steps which are' usually taken, when 
an achievement test is selected. There appear to be three a 
contexts within which test are chosen. These three contexts are: 



1. Any test may be chosen, 



c2. One test is clearly the best, and 

3, One test •is mandated, 

. All three of these situations have been en^countered 

in Austin, and we have found that following only the steps necessary 
saves everyone time from pu'rsu,ing nonoptians, ^ 

WHAT 1$ THE RATING SCALE? - . 

Attachment 1 presents the rating scale, titled "Factors 
for Rating Achievement Tests." Th^ rating scale has two intj^hded 
purpx)ses: 

V 1* To expedite a systematic comparison among- achievement 
tests, and 

2. To allow a weighting of the factors to be rated 
according to thesystera's needs. 

The rating scale contains three components: 

1. Sujb factor ratings,' 

2. Factor ratings, and 

3. Overall 'weighting|for the test. 

Subf actor s are the subsidiary considerations which, ta.ken as a 
whole, make up each qf the five factors'. Subfactors are assigned 
ratings by the indivi^dual or group using the rating scale on the 
fol 1 owl ng basis: v. - ' 

5 = Adequate i * - " 

4 ' Mostly Adequpste . - . 

3 '= Partly Adequacy, Partly Inadequate 

2 = Mostly Inad^ate 

1 • = Inadequate * 

For e'xample, if under the factor of technical soundness, « 
the reliability of the candidate test i.s judged by the rater to 
be "roost 1 y adequate, " the rater would assiign a rating of 4 to 
this subfactor. The number of- subf actors under each or the 
factors differs. However, si^nce an arithmetic average of the 
subfactor ratings 1s taken to arrive at the five factor ratings, 
each subfactor component is equally represented in the factor 
ratings. In cases where the average is a decimal fraction, the 
average should be rounded to the nearest whole number. As will 



be not6d .from Attachment 1, several types of relia>Kiljty should 
be considered by the rajjer. In such cases where t^e subf^ctor 
is not a unitary concept, the rater may choose tafe^sigiO ratings 
to- each of the subcomponents under the subfactor Vncf- first 
averac^e these, or may simply consider these subcomponents in 
making an overall subfactor rating. 

.r -'It should be noted that these subfactor ^rafings' are a 
matter of sub'jecti ve j udgment on the part of the rater, based on 
t'tie rajter's experience and f uijii 1 i ari t y thi'ouyh iitudy of^the test 
under <ons i derat ] on. ^ In fact,' subfactor ratings will probably 
differ from individual to individual and from group 
(administrators, teachers, parents, etc.) to'group. * ■ , 
Therefore, it is to the benefit of the school district to obtain 
multiple ratings of a tfst for compari son and as a basis for 
further study and discussion should the differences be extreme. 

However, it is in this connection that the Fatal Flaw 
Principle may usefully be employed. If any of the five factor, 
rati^igs, made up of the aver ages of the subfactor ratings, is 
less than 3, i.e., mostly inadequ'^ite or inadequate, the test 
shjQijjId be dropped from consideration. This is a us'eful procedure 
which helps to narrow the r.ange of candidate tests and which 
serves as an an-chpr. for the different rating groups. 

The third component of the rating scale, an overall - 
weighting o f' each , o f the ^ive factors, is additional 
mechanism for separ a-t'i ng l^aadequate tests from adequat'e 
tests based on judgments of the importance of each of 
the five factors. The averall weightings shoul d" be treated as 
percentages suromi ng to 10,0. ~^For example, , the logis^tical 
feasibility of a test may be of paramount importance to a 
district. ' In- this event, the ralter might assign a 60X weight to 
this factor and, perhaps, a 10% weight to each of the remaining 
four factors.. To arrive .at the weighted factor rat i ngs , then, 
the rater would multiply the factor ratings by the percentage 
weights. In the example, if the factor rating for log-isfcical 
feasibility was 3, the weighted factor rating would equal 180. 
Note, however, that ^fter such weighted factors have been 
obtained, the only vaH d 'compari sons are between tests, not^ 
between factors on the same test. Should a rater desire to ^' 
compare factors on the same test;' the unweighted factor ratings 
shoul d be used. ^ 

iJhat a're the FIVE factors? ^ . 



Techni cal Soundness ^ ' _ . 

This first factor Ms the one which,- i ni t i al 1 y i s of the 
most importance to the district testing office, and of lesser 
concern tothe other consumers. Ul t i ma tel y, , however , it is the 
base upon which the whole t-esti ng ente1\jpri se will rest because.- 
sooner or later, given the high visibility which test scores hive 



assumed for'.an accountami i ty-consci ous pu1)lic,.the test which is 
chosen must stand up to tjhe cr 1 1 i cal revi ew\ of the other 
consumers. Therefore, tnis' factor must be given thorough 
consideration, even to tne extent of disqualifying from t^e 
outset some tests which are apparently attracti Ve ^f rom the 
standpoint of the other four factors. In fact, as embodied^ in 
the fat'al flaw principle, each, of the five factors cofitains 
features w^hich may di squal 1 fy a test frow consideration desp^i te 
the test ' s' strengths in other areas. This paper does not pulrport 
to be an exhaustive examination of the technical bases for 
evaluating the psychometric properties of a test since these 
aspects are'covered at length elsewhere. However, the rating 
scale should be a workabl e, checkl i st touching*^ the major 
criteria. . ' " - 

Logistical FeasibLlitv ' ' " ^ - 

This factor attempts to address thos<e features of a. test 
concerned wi th' the logistics of bhe actual a<^rai ni strati on of the 
test. One consideration is the levels of the test which are 
available. Is there a* level available for each grade to be 
tested, or are several grades to administered the same test 
level? In either event, there should be good articulation across 
tes t . 1 evel s . 

A second consideration is whether there are alternate 
formsf^ of the test available. Alternate forms are desirable for 
■several reasons: * 

1. If the same form of the test is given repeatedly, 
particularly if the same level is given to different 
grades over the course of several years, as was the 
case with AISD's high school achievement test, the 
test takers become f ami 1 j ar with the test itjferely 

^ from its repetition. In this event, it is useful to - 

have at least one al ter nate f orm of the test to 
ensure that reliable test results are obtained. 
In Austin, two forms of^tKe high. school 
ach'ievement test were al ter nated annually at each' 
high school. In th'is way, the* test takers were 
-presented with a s 1 i ght 1 y ^di f f erent test in 
a 1 ter nat i ngir year s . 

2. Fami 1 i ar i ty xi th the te^t also tends to promote a 
shift in instructional practices toward the content^, 
of the test. An alterr/at'e form of the test helpsr^ 
ensure that ij^he i nstrucvbi pnal focus does. not become 
too rrarrow, but roust inVtead remain sensitive to 

* -\ the slight differences f\om one form of the 

test to th>j5 other. 



3. In the instance where cheating on the test is 
suspected, an alternate form of the test may 



be admi ni s^ered for* compar i son ' purposes . 

/ Foncti onal -1 evel testing may also be a desirable 

characteristic qf a test. . ' ^ 

Another consideration is the amou^nt of. time the test 
requfres for adroi n i s t r-sTt i o n . With a premium on 
i nstructi ona-1 time, it is desirable for the test to . 
take as little time as possible away from instruction. Hence, .a 
tes^J- whose subtests ' are shorter translates to fewer days over 
whil^ch' the test must, be administered. 

Finally, ease of scoring is an iroVortant consideration . 
in terms of a test's logistical feasibility. In smaller school 
districts, where scoring is done manually, a test' 
which is readily hand scored and which qui ckl y 'generates derived 
scores is preferable to a test with complex scoring procedures 
which may necessitate sending' ;the ansvjftr documents to the test > 
publisher for scoring, at e^tra cost tD the district and with a 
consequent t i me delay in recei.ving the scores. In larger school 
distric'ts, which have , the capability to score the test '"in 
house," a test with the requisite conversion tables and other 
techni cal -documentati on readil'7' available is essential. 

Instructional Validity 

This is a factor of prime concern, since i^nless the 
test covers the instructional areas in which information 
is required, it cannot be valuable, even if it has - 
excellent technical soundness/ I^n this regard, the match 
between the test and the curriculum must be a.s close as possijble. 
A test that matches the system's curriculuiA very 'wel 1 might be a 
shoo-in candidate. It shou-ld be noted that, a perfect match bety(een 
a district's curriculum and the content of a commercially published 
test is unlikely. They**fore, an alternative for a district is a 
1 oxal 1 y devel oped test which can.be constructed which would ensure 
a closer m^.tch. However, this is an expensive and time-consuming 
alternative. 'An additional drawback is. the lack of national 
norms with which 'to compare distpici achievement. . * ^ 

Related to the match with curriculum is the consideration 
of the terminology used by the candidate test. If the language 
used in the administration directions and in the test items is a 
noticeable departure fr-om that customarily used by teachers in 
their everyday instruction, some pAovisio.ns have to be made e^arly 
in the school year,, well before the test admi n-i strati on, to \ 
acquaint the stutlents with the test's terminology. 

Finally, an important consideration, perhaps of 
overriding i mpor tance, i s the utility of the test results for. 
instruction. Apart from the accountabl i 1 i ty function of the test 
for curriculum decision-s, the test must generate usefuf 
information for teachers to use in the instruction of individual 
studen'ts. To this end, the test must contai.n information for 
districts to create skill profiles of individual student, 
strengths and weaknesses. * " --^S^ * 



This factor is an iiftportant and sometl mes overlooked 
concftrrl. Even if a test is excellent in terms ' of - i ts technical 
soundness,' 1 ogi sti cal • f eas«i bi 1 i ty , and 1 nstructi-onal validity, it < 
must be affordable by the district. Financial considerations include 

1 . Start-up costs. The di&trict must make^alarge 
initial outlay of funds fo purchase teachers , gu i des , 

scoring keys,, and. the lik6 for annual use. It must 
also purchase reusable booljilets^, if the test Is for. 
^ grade levels above the ear 1 y pr i mary , for students 

who'mark on a separate answer sheet. 

2. Annual costs . For students in the early primary 
grades who cannot use a separate answer sheet and 
must mark directly onto the test booklet, t-he 
district must purchase replacement booklets annually. 
It must also purchase additional tedf'cher quides and 
other testing mater ials^as these bexome del api dated 
due to ordinary use or due to the rigors of shipping, 

Interpretational Ease ' ' ^ ' ' 

This factor is related to the utility of the test for ^ 
generating .interpretable results. A var^'ety of scores should be 
available, i n-cl udi ng ^n o ver al 1 compos i tV-^core, scores for. each 
of the major domaihs tested, e.g., a total mathematics score, and , 
scores which can be groupe'd according to each of the subskills 
tesH:ed , e. g . , number sense. The test shpuld have^norming dates 
useful for annyal achievement comparisons and for major projects 
such as Chapter 1. The test battery should include tests in 
areas of concern to school d*istricts, including newer areas such 
as life skills and-computer literacy. Test stores, shoul d allow 

comparisons both on a longitudinal and a cross-sectional basis. 

* \ ■ • . * ' ; 

CONCLUSIO/T 

^ometfmes we think that a s\:hool system should ne^er <^ 
change achievem^t^ te$ts--especi al 1 y if we know the implications 
of replacing one test with another. However, once past the 
hurdle of committi^ng to a change, learning from the experience of 
others can make the selection process much more efficient and 
productive. In this paper, we ha ve^out 1 i ned and ^described the 
following aspects of the process of selecting an achievement test:' 



The factors to co^nsider. 

The people with the greatest responsibility to ensure 
.adequacy for the new test on each factor, 



3. The Fatal Rlaw Pri ncl p'^e^whi ch el i iri na tes^ a te^t 

4. T-he Shoo-In Principle, which selects a. single test 
outright, and 

^ . * i ' 

5. The context within which a tevt is ^elected. 

• Is it pretentious of us fro have'laid out "^ch an 
elaborate process for others to follow without having first put 
it to the test, so t4> speak? Of course it is. Our purpose, we 
»ust admit, was to ddcuaent our ideas for ourselves while the 
emotions of our recent test changes are still fresh. We do 
Intend to follow the suggestions of this paper the next time we 
change one of our achievement tests. Unti.l then, if a'riy reader 
takes our plan to heart, 'let us know how it works. 
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Figure 1: WHO ARE THE EXPERTS? Who has major responsibility^ 

to ensure adequacy on each' factor for the new y 
achievement test? • > ^ 




Figure 2: STEPS FOR SELECTING AN 
ACHTEVtWENT TEST ^ 
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Attachment 1 

FACT0RS FOR RATING ACHIEVEMENT TESTS 

T.est: - ' 

Directions: Assign a numerical rating from 1 to 5 to each of the factors 
Which follow. In this scale: 5 = Adequate 

'4 = Mostly Adequate 

3 = Partly Adequate, Partly Inadequate 

2 = Mostly Inadequate 

1 = Inadequate ^ 

Factor ratings will be based, on an average of subfactor ratings. 

Ratings (1-5) Factors 

J I. Technical Soundness 

(P^mary raters: Testing Staff) 

A. Reliability 

y 1. Test-Retest « 

2. Internal Consistency 

3. Correlation with Other Tests/Forms 

B. Validity 

1 . Divergent 

2. Factorial 

3 . Concurrent 

4 . Predictive 

C. Norms 

1. Empirical Norms 

2. Critical Norming Dates * 

3. Norm Sample 

a. National Representation 

b. Size of Sample 

c. Subgroup Norms 

1 . Urban 

2 . Regional 

d. Consistencies 

1. 50th %ile = GE for 
time of testing 

2. +1.0 GE/year growth 
at 50th %ile 

3. >1.0 GE/year growth 
above SOth %ile 

4. <1.0 GE/year growth 

below 50th %ile 

5. Standard score growth 
rate logical 

D. Fairness 

1. No Sex Bias 

2. No Ethnic Bias 
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II. Logistical Feasibility 

(Primary raters: Teachers and Principals^ 

Testing Staff) 
(Secondary raters: Central Administration ) 

A. Levels Available per Grade Span 

B. Alternate Forms Available 

C. Out-of -Level Testing 

1. Booklet Adaptations , 

2. Administration Differences 
Across Levels 

D. Testing Schedule ^ . 
1 . Days 

'2. Time per Test 

4 

E. Ease of Scoring 

1. Manual Scoring 

2. Publisher's Scoring Service 

3. District's Machine Scoring 

F. Format 

1. Physical Layout of Items 

2. Print 

3. Graphics and Illustration 

G. Clarity of Directions 

1. For Test Administrators 

2. For Students 

H. Availability of Student Practice 
Materials 

> I. Training Requirements for 
Administrators 

J. Other Language Editions 

K. Editions for tHe Handicapped 



III. Instructional Validity 

■ * * 

(Primary raters: Teachers and Principals) 
(Secondary raters: Testing- Staff , Central 

Administration ) 

A. Areas Tested 

B. Match with Curriculum 

1 . Content 

2. Skills Levels 

3. Terminology 

C. Utility of Test 




IV. Financial Af f ordability 

(Primary raters: Central Administration, 

Board of Trustees ) 
(Secondary raters: Testing Staff) 

A. Start-Up Costs 

1. Reusable Booklets 

2. 'Teacher Guides 

B. Annual Costs 

1. Disposable Booklets 

2. Answer Sheets 

3. Scoring 

4. Reporting 

C. Adjusted Annual Per Pupil Cost: 

[ (Start-Up Costs) = (Annual Costs 
X Years Life Expectancy of Test] - 
Years Life Expectancy of Test 
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V. Interpretational Ease 



(Primary raters: Parents / Teachers and 

Principals/ Testing Staffs 
Central\Adinini strat ion / 
Board of "Trustees) 

A. Scores Available 'X^f ^ 
1 . Overall Composite 
2\ Major Domains 
3. Skills Areas 

p. Normihg Date 

C. Test Areas 

D. Comparability 

1. Past Scores 

2. Other Districts^Groups 



Test Rating Summary 
Factor Weight x Rating = Total 
I ' 
II 

III . ^ 

IV 

V 

100 ^Overall Rating 

Weight: Divide 100 up among the five factors (for example: 

10/ 30/ 20/ 10/ 30) to represent the relative importance 
of each in the decision. 
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